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The next step in the understanding of the genome organization, after the determination 
of complete sequences, involves proteomics. The proteome includes the whole set of 
protein-protein interactions, and two recent independent studies have shown that its 
topology displays a number of surprising features shared by other complex networks, 
both natural and artificial. In order to understand the origins of this topology and 
its evolutionary implications, we present a simple model of proteome evolution that 
is able to reproduce many of the observed statistical regularities reported from the 
analysis of the yeast proteome. Our results suggest that the observed patterns can be 
explained by a process of gene duplication and diversification that would evolve proteome 
networks under a selection pressure, favoring robustness against failure of its individual 
components. 
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1. Introduction 

The genome is one of the most fascinating examples of the importance of emergence 



from network interactions. The recent sequencing of the human genome [123 
revealed some unexpected features and confirmed that "the sequence is only the 
first level of understanding of the genome" [ j38| . The next fundamental step beyond 
the determination of the genome sequence involves the study of the properties of 
the proteins the genes encode, as well as their interactions ]l2| ]. Protein interactions 
play a key role at many different levels and its failure can lead to cell malfunction 
or even apoptosis, in some cases triggering neoplastic transformation. This is the 
case, for example, of the feedback loop between two well-known proteins, MDM2 
and p53: in some types of cancers, amplification of the first (an oncoprotein) leads 
to the inactivation of p53, a tumor-suppressor gene that is central in the control of 
the cell cycle and death |f47j . 

Understanding the specific details of protein-protein interactions is an essen- 
tial part of our understanding of the proteome, but a complementary approach is 
provided by the observation that network-like effects play also a key role. Using 
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again p53 as an example, this gene is actually involved in a large number of inter- 
action pathways dealing with cell signaling, the maintenance of genetic stability, or 
the induction of cellular differentiation [39). The failure in p53, as when a highly 
connected node in the Internet breaks [1 , has severe consequences. 

Additional insight is provided by the observation that in many cases the total 
suppression of a given gene in a given organism leads to a small phenotypic effect 
or even no effect at all [ [32]ffl| ]. These observations support the idea that, although 
some genes might play a key role and their suppression is lethal, many others can be 
replaced in their function by some redundancy implicit in the network of interacting 
proteins. 

Protein-protein interaction maps have been studied, at different levels, in a 
variety of organisms including viruses P,p^,p5[ , prokaryotes pj] , yeast 18 , and 



multicellular organisms such as C. elegans W4l . Most previous studies have used the 
so-called two-hybrid assay Jlij based on the properties of site-specific transcriptional 
activators. Although differences exist between different two-hybrid projects |l(| the 
statistical patterns used in our study are robust. 

Recent studies have revealed a surprising result: the protein-protein interaction 
networks in the yeast Saccharomyces cerevisiae share some universal features with 
other complex networks |$5|. These studies actually offer the first global view of 
the proteome map. These are very heterogeneous networks: The probability P(k) 
that a given protein interacts with other k proteins is given by a power law, i.e. 
P(k) ~ k i with 7 « 2.5 (see figure |l|), with a sharp cut-off for large k. This 
distribution is thus very different from the Poissonian shape expected from a simple 
(Erdos-Renyi) random graph [p|p2]]. Additionally, these maps also display the so- 
called small-world (SW) effect: they are highly clustered (i.e. each node has a well- 
defined neighborhood of "close" nodes) but the minimum distance between any two 
randomly chosen nodes in the graph is short, a characteristic feature of random 
graphs p5[ . 

As shown in previous studies |jj this type of networks is extremely robust against 
random node removal but also very fragile when removal is performed selectively 
on the most connected nodes. SW networks appear to be present in a wide range 
of systems, including artificial ones p|,p ,[l0[|29[ ] and also in neural networks Gjjj34|, 
metabolic pathways |],|2(^,^3| (see also |2q|), even in human language organization 
[||. The implications of these topologies are enormous also for our understanding 
of epidemics 

The experimental observations on the proteome map can be summarized as 
follows: 

(1) The proteome map is a sparse graph, with a small average number of links 
per protein. In |42|| an average connectivity K ~ 1.9 — 2.3 was reported 
for the proteome map of S. cerevisiae. This observation is also consistent 
with the study of the global organization of the E. coli gene network from 
available information on transcriptional regulation |3q|. 
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Fig. 1. Degree distributions for two different data sets from the Yeast proteome: A: Ref. J42|]; 
B: Ref. jlSj. Both distributions display scaling behavior in their degree distribution P(k), i.e. 
P(fc) ~ k a sharp cut-off for large k and very small average connectivities: K\ = 1.83 (total 
graph) and Kb = 2.3 (giant component), respectively. The slopes are ■yA ~ 2.5 ± 0.15 and 
7s « 2.4 ±0.21. 



It exhibits a SW pattern, different from the properties displayed by purely 
random (Poissonian) graphs. 

The degree distribution of links follows a power-law with a well-defined 
cut-off. To be more precise, Jeong et al. |l9j reported a functional form for 
the degree distribution of S. cerevisiae 

P{k)~{k + k)-~<e- k l k c. (1.1) 

A best fit of the real data to this form yields a degree exponent 7 « 2.5 
and a cut-off k c w 20. This could have adaptive significance as a source of 
robustness against mutations. 

In this paper we present a model of proteome evolution aimed at capturing the 
main properties exhibited by protein networks. The basic ingredients of the model 
are gene duplication plus re- wiring of the protein interactions, two elements known 
to be the essential driving forces in genome evolution ^7|. The model does not 
include functionality or dynamics of the proteins involved, but it is a topologically- 
based approximation to the overall features of the proteome graph and intends to 
capture some of the generic features of proteome evolution. 

During the completion of this work we became aware of a paper by Vazquez et 
al. , Ref. |37[ , in which a related model of proteome evolution, showing multifractal 
connectivity properties, is described and analyzed. 



2. Proteome growth model 

Here we restrict our rules to single-gene duplications, which occur in most cases due 
to unequal crossover |27j , plus re- wiring. Multiple duplications should be considered 
in future extensions of these models: molecular evidence shows that even whole- 
genome duplications have actually occurred in S. cerevisiae jl6| (see also Ref. jl0| ) . 
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Fig. 2. Growing network by duplication of nodes. First (a) duplication occurs after randomly 
selecting a node (arrow). The links from the newly created node (white) now can experience 
deletion (b) and new links can be created (c); these events occur with probabilities 8 and a, 
respectively. 

Re- wiring has also been used in dynamical models of the evolution of robustness in 
complex organisms H. 

It is worth mentioning that the study of metabolic networks provides some 
support to the rule of preferential attachment [Q as a candidate mechanism to ex- 
plain the origins of the scale-free topology. Scale-free graphs are easily obtained in 
a growing network provided that the links to new nodes are made preferentially 
from nodes that already have many links. A direct consequence is that vertices 
with many connections are those that have been incorporated early. This seems 
to be plausible in the early history of metabolic nets, and this view is supported 
by some available evidence ^3|. A similar argument can be made with proteome 
maps, since there are strong connections between the evolution of metabolic path- 
ways and genome evolution, and other scenarios have also been proposed, including 
optimization jLl[ . Here we do not consider preferential attachment rules, although 
future studies should explore the possible contributions of different mechanisms to 
the evolution of network biocomplexity. In this context, new integrated analyses of 
cellular pathways using microarrays and quantitative proteomics |17| will help to 
obtain a more detailed picture of how these networks are organized. 

The proteome graph at any given step t (i.e. after t duplications) will be indi- 
cated as £l p (t). The rules of the model, summarized in figure |2|, are implemented as 
follows. Each time step: (a) one node in the graph is randomly chosen and dupli- 
cated; (b) the links emerging from the new generated node are removed with prob- 
ability 6; (c) finally, new links (not previously present) can be created between the 
new node and all the rest of the nodes with probability a. Step (a) implements gene 
duplication, in which both the original and the replicated proteins retain the same 
structural properties and, consequently, the same set of interactions. The rewiring 
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steps (b) and (c) implement the possible mutations of the replicated gene, which 
translate into the deletion and addition of interactions, with different probabilities. 

Since we have two free parameters, we should first constrain their possible values 
by using the available empirical data. As a first step, we can estimate the asymptotic 
average connectivity exhibited by the model in a mean-field approximation (see also 
Ref. f57j). Let us indicate by the average connectivity of the system when it 
is composed by N nodes. It is not difficult to see that the increase in the average 
connectivity after one iteration step of the model is proportional to 

^ ~ K N+1 -R N = ^[K N - 25K N + 2a(N - K N )] . (2.1) 

The first term accounts for the duplication of one node, the second represents the 
average elimination of 5K n links emanating from the new node, and the last term 
represents the addition of a(N — A" at) new connections pointing to the new node. 



Eq. ( |2.l| ) is a linear equation which easily solved, yielding 



*iV = -^ + fe--^W (2.2) 
a + o \ a + o J 

where r = 1 — 2a — 25 and K\ is the initial average connectivity of the system. 
This solution leads to an increasing connectivity through time. In order to have a 
finite K in the limit of large N, we must impose the condition a — (3/N, where 
B is a constant. That is, the rate of addition of new links (the establishment of 
new viable interactions between proteins) is inversely proportional to the network 
size, and thus much smaller than the deletion rate 5, in agreement with the rates 
observed in jl^]. In this case, for large iV", we get 

^§K ss h 1 -2S)R N + ^. (2.3) 
dN N N y ' 

The solution of this equation is 

For 6 > 1/2 a finite connectivity is reached, 

28 

The previous expression imposes the boundary condition S > 1/2, necessary in 



order to obtain a well-defined limiting average connectivity. Eq. (2.5), together 
with the experimental estimates of K ~ 1.9 — 2.3, allows to set a first restriction to 
the parameters f3 and 5. Imposing K — 2, we are led to the relation 

P = 25-l. (2.6) 

Moreover, estimations of addition and deletion rates a and 6 from yeast [[l2| give 
a ratio a/5 < 10~ 3 . For proteomes of size N ~ 10 3 , as in the case of the yeast, 
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A) 




B) 



Fig. 3. A) An example of a small proteome interaction map (giant component, Coo) generated 
by the model with N = 10 3 , <5 = 0.58, and /3 = 0.16. B) Real yeast proteome map obtained from 
the MIPS database We can observe the close similitude between the real map and the output 
of the model. 

this leads to (3/8 < 10~ 3 A ~ 1. Using the safe approximation (3/5 ~ 0.25, together 
with the constraint (|2.5|), we obtain the approximate values 



which will be used through the rest of the paper. 

Simulations of the model start form a connected ring of N = 5 nodes, and 
proceed by iterating the rules until the desired network size is achieved. 

3. Results 

Computer simulations of the proposed model reproduce many of the regularities 
observed in the real proteome data. As an example of the output of the model, 
in figure ||A we show an example of the giant component f^oo (the largest cluster 
of connected proteins) of a realization of the model with N = 10 3 nodes. This 
figure clearly resembles the giant component of real yeast networks, as we can see 
comparing with figure |B0, and we can appreciate the presence of a few highly 
connected hubs plus many nodes with a relatively small number of connections. 
The size of the giant component for N = 10 3 , averaged of 10 4 networks, is |f2oo| = 
472 ± 87, in good agreement with Wagner's data |f2^| = 466 for a yeast with 
a similar total number of proteins (the high variance in our result is due to the 
large fluctuations in the model for such small network size N). On the other hand, 
in figure |4| we plot the connectivity P(k) obtained for networks of size N — 10 3 . 
In this figure we observe that the resulting connectivity distribution can be fitted 



5 = 0.58, 



= 0.16. 



(2.7) 



a Figure kindly provided by W. Basalaj (see 



http://www.cl.cam.uk~wb204/GD99/#Mewes 
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Fig. 4. Degree distribution P(k) for the model, averaged over 10 4 networks of size N 
The distribution shows a characteristic power law behavior, with exponent 7 : 
exponential cut-off k c ~ 28. 



10 J 



2.5 ±0.1 and an 



to a power-law win an exponential cut-off, of the form given by Eq. (1.1), with 
parameters 7 = 2.5 ± 0.1 and k c ~ 28, in good agreement with the measurements 
reported in Refs flf] and gf]. 

An additional observation from Wagner's study of the yeast proteome is the 
presence of SW properties. We have found also similar topological features in our 
model, using the considered set of parameters. The proteome graph is defined by 
a pair Q p = (W p , E p ), where W p — {pi}, (i — 1, N) is the set of N proteins and 
E p = {{pi,Pj}} is the set of edges/connections between proteins. The adjacency 
matrix £y indicates that an interaction exists between proteins Pi,Pj G Q p = 1) 
or that the interaction is absent (£y = 0). Two connected proteins are thus called 
adjacent and the degree of a given protein is the number of edges that connect it 
with other proteins. 

The SW pattern can be detected from the analysis of two basic statistical quan- 
tities: the clustering coefficient C v and the average path length L. Let us consider 
the adjacency matrix and indicate by I\ = {pi \ £y = 1} the set of nearest neighbors 
of a protein pi S W p . The clustering coefficient for this protein is defined as the 
number of connections between the proteins pj G I\ [fl5f . Denoting 

N 

. (3-1) 



c, 



£e, 



we define the clustering coefficient of the i-th protein as 

2C. 



c v (i) 



ki{ki 1) 



(3.2) 
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Table 1. Comparison between the observed regularities in the 
yeast proteome p2| , the model predictions with N = 10 3 , 
<5 = 0.58 and (3 = 0.16, and a random network with the same 
size and average connectivity as the model. The quantities X 
represent averages over the whole graph; X 9 represent averages 
over the giant component. 



Yeast proteome Network model Random network 



K 


1.83 


2.2 ±0.5 


2.00 ±0.06 


K 9 


2.3 


4.3 ±0.5 


2.41 ±0.05 


7 


2.5 


2.5 ±0.1 






466 


472 ± 87 


795 ± 22 


C y 


2.2 x 10" 2 


1.0 x 10~ 2 


1.5 x 10~ 3 


L» 


7.14 


5.1 ±0.5 


9.0 ±0.4 



where fcj is the connectivity of the i-th protein. The clustering coefficient is defined 
as the average of c v (i) over all the proteins, 

1 - 

^ =*?!>(*)> ( 3 - 3 ) 

and it provides a measure of the average fraction of pairs of neighbors of a node 
that are also neighbors of each other. 

The average path length L is defined as follows: Given two proteins Pi,Pj € W p , 
let L m i n (i,j) be the minimum path length connecting these two proteins in Q p . 
The average path length L will be: 



^2L min (i,j) (3.4) 



N(N- 1) 

Random graphs, where nodes are randomly connected with a given probability p 
M, have a clustering coefficient inversely proportional to the network size, CJ and « 
K/N, and an average path length proportional to the logarithm of the network 
size, L rand w log AT/ log K. At the other extreme, regular lattices with only nearest- 
neighbor connections among units are typically clustered and exhibit long average 
paths. Graphs with SW structure are characterized by a high clustering with C v S> 
C™" rf , while possessing an average path comparable with a random graph with the 
same connectivity and number of nodes. 

In Table [l] we report the values of K, 7, |Ooo|, C v , and L for our model, com- 
pared with the values reported for the yeast S. cerevisiae [|l9|,^2), and the values 
corresponding to a random graph with size and connectivity comparable with both 
the model and the real data. Except the average connectivity of the giant com- 
ponent, which is slightly larger for the model, all the magnitudes for the model 
compare quite well with the values measured for the yeast. On the other hand, the 
values obtained for a random graph support the conjecture of the SW properties 
of the protein network put forward in Ref . M2| . 
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4. Discussion 

The analysis of complex biological networks in terms of random graphs is not new. 
Early work suggested that the understanding of some general principles of genome 
organization might be the result of emergent properties within random networks 
of interacting units pl| , p2[ . An important difference emerges, however, from the 
new results about highly heterogeneous networks: the topological organization of 
metabolic and protein graphs is very different from the one expected under totally 
random wiring and as a result of their heterogeneity, new qualitative phenomena 
emerge (such as the robustness against mutation). This supports the view that cel- 
lular functions are carried out by networks made up by many species of interacting 
molecules and that networks of interactions might be at least as important as the 
units themselves p5| , p3[ . 

Our study has shown that the macroscopic features exhibited by the proteome 
are also present in our simple model. This is surprising, since it is obvious that 
different proteins and protein interactions play different roles and operate under 
very different time scales and our model lacks such specific properties, dynamics 
or explicit functionality. Using estimated rates of addition and deletion of protein 
interactions as well as the average connectivity of the yeast proteome, we accurately 
reproduce the available statistical regularities exhibited by the real proteome. In 
this context, although data from yeast might involve several sources of bias, it has 
been shown that the same type of distribution is observable in other organisms, 
such as the protein interaction map of the human gastric pathogen Helicobacter 
pylori or in the p53 network (Jeong and Barabasi, personal communication). 

These results suggest that the global organization of protein interaction maps 
can be explained by means of a simple process of gene duplication plus diversifica- 
tion. These are indeed the mechanisms known to be operating in genome evolution 
(although the magnitude of the duplication event can be different). One impor- 
tant point to be explored by further extensions of this model is the origin of the 
specific parameters used. The use of evolutionary algorithms and optimization pro- 
cedures might provide a consistent explanation of the particular values observed 
and their relevance in terms of functionality. A different source of validation of 
our model might be the study of proteome maps resulting from the evolution of 
resident genomes || : the genomes of endosymbionts and cellular organelles display 
an evolutionary degradation that somehow describe an inverse rule of proteome re- 
duction. Reductive evolution can be almost extreme, and available data of resident 
proteomes might help to understand how proteome maps get simplified under the 
environmental conditions defined by the host genome. If highly connected nodes 
play a relevant role here, perhaps resident genomes shrink by loosing weakly con- 
nected nodes first. 

Most of the classic literature within this area deal with the phylogenetic conse- 
quences of duplication and do not consider the underlying dynamics of interactions 
between genes. We can see, however, that the final topology has nontrivial con- 



1, 2008 21:37 WSPC/Guidelines proteoACS 



10 R. V. Sole, R. Pastor- Satorras, E. Smith, and T. Kepler 

sequences: this type of scale-free network will display an extraordinary robustness 
against random removal of nodes and thus it can have a selective role. But an 
open question arises: is the scale-free organization observed in real proteomes a 
byproduct of the pattern of duplication plus rewiring (perhaps under a low-cost 
constraint in wiring) and thus we have "robustness for free"? The alternative is 
of course a fine-tuning of the process in which selection for robustness has been 
obtained by accepting or rejecting single changes. Further model approximations 
and molecular data might provide answers to these fundamental questions. 
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